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Abstract In this work we consider the problem of learning the structure of 
Markov networks from data. We present an approach for tackling this prob- 
lem called IBMAP, together with an efficient instantiation of the approach: the 
IBMAP-HC algorithm, designed for avoiding important limitations of existing 
independence-based algorithms. These algorithms proceed by performing statis- 
tical tests of independence on data, trusting completely the outcome of each test. 
In practice tests may be incorrect, resulting in potential cascading errors and the 
consequent reduction in the quality of the structures learned. IBMAP contemplates 
this uncertainty in the outcome of the tests through a probabilistic maximum-a- 
posteriori approach. The approach is instantiated in the IBMAP-HC algorithm, 
a structure selection strategy that performs a polynomial heuristic local search in 
the space of possible structures. We present an extensive empirical evaluation on 
synthetic and real data, showing that our algorithm outperforms significantly the 
existent independence-based algorithms, in terms of data efficiency and quality of 
learned structures, with equivalent computational complexities. We also show the 
performance of IBMAP-HC in a real-world application of knowledge discovery: 
EDAs, which are evolutive algorithms that use structure learning on each gen- 
eration for modeling the distribution of populations. The experiments show that 
when IBMAP-HC is used to learn the structure, EDAs improve the convergence 
to the optimum. 
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1 Introduction 

We present in this work the IBMAP (Independence-Based Maximum a Posteriori) 
approach for robust learning of Markov network structures from data, together 
with IBMAP-HC, an efficient hill-climbing instantiation of the approach. Markov 
networks, together with Bayesian networks, belong to the family of probabilistic 
graphical models [16] . a computational framework for compact representation of 
joint probability distributions. They consist of an undirected (Markov networks) 
or directed (Bayesian networks) graph G and a set of numerical parameters 6. 
Each node in G represents a random variable of the domain, and the edges encode 
conditional independences between them. Therefore, the graph G is called the 
independence structure of the distribution. The importance of these independences 
is that they factorize the joint distribution over the domain variables into factors 
over subsets of variables, resulting in important reductions in the space complexity 
required to store a distribution [32] . 

An interesting problem is to represent the distribution of historical information 
in a Markov network. This task can be done either by a human expert, or by 
learning it automatically from data. Naturally, the knowledge of experts is hard to 
obtain, and not always enough to design a proper Markov network. For this reason, 
algorithms for learning automatically such models are considered an increasingly 
important tool for knowledge discovery. The problem consists in two tasks, learning 
G and learning 8. In this work we focus in the problem of learning G, that has 
shown to be an NP-hard problem [TU] since the number of structures grows super- 
exponentially with the number of variables of the domain. 

The literature considers two broad approaches for learning G: score-based [131 
22,19,14, and independence-based (also known as constraint-based) algorithms 
32 : . The score-based approach is intractable in practice for large domains for two 
reasons: (i) it approaches the problem as an optimization on the space of G, and (ii) 
for each G during the search it requires a learning step of the model parameters 0, 
which involves an expensive inference step [34 . The independence-based approach 
proceeds by performing statistical tests of independence on the data, and based 
on the outcome of the tests discards all structures inconsistent with the test. This 
approach is efficient, and correct under assumptions, but in practice presents qual- 
ity problems: one of the assumptions is the correctness of independence assertions, 
which may not be true in practice for statistical tests when data is insufficient. 
These problems are described in detail in a recently published survey [29] ■ It is 
important to mention that both score-based and independence-based approaches 
have been motivated by distinct learning goals. Generally, score-based approaches 
are better suited for the density estimation goal, that is, tasks where inferences 
or predictions are required [23] ■ In contrast, independence-based methods are bet- 
ter suited for other learning goals, such as feature selection for classification, or 
knowledge discovery |321f4"l[5] . 

IBMAP follows the independence-based approach but relaxes the assumption 
of correctness of independence assertions. Instead of trusting the outcome of a 
statistical test on data, it considers explicitly the posterior probability of inde- 
pendences given the data. As explained in detail later on, these posteriors are 
combined into the posterior of the whole structure (given the data), deciding on 
the output structure following the well-known maximum-a-posteriori approach. 
This clearly circumvents the cascading error, as the true structure is no longer dis- 
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carded on an incorrect test, it only results in a lower posterior probability. With 
further tests, the posterior probability of the true structure may increase again. 

In order to evaluate the improvements in the quality of the structures pro- 
duced by our approach, we performed detailed and systematic experiments over 
synthetic datasets, real-world datasets. In all those cases we compared the struc- 
tural errors of the structures learned by IBMAP-HC and against those learned by 
representative, state-of-the-art competitors: GSMN [HE], and HHC-MN, a simple 
adaptation for Markov networks of an independence-based structure learning al- 
gorithm for Bayesian networks, called HHC [5]. We note that structural errors as 
quality measure is the most appropriate for knowledge discovery algorithms such 
as those using the independence-based approach. 

Additionally we tested the performance of IBMAP-HC in a real world applica- 
tion: Estimation of Distribution algorithms (ED As) [251117] . These algorithms are 
variations of the well-known evolutionary algorithms, that replace the crossover 
and mutation stages for generating a new population, by learning a probability 
distribution of the current population. This application is relevant because ED As 
solve problems that are known to be hard for traditional genetic algorithms. We 
tested IBMAP-HC in a state-of-the-art EDA algorithm based on Markov networks 
structure learning, called the MOA algorithm [31]. In our experiments, MOA im- 
proves its convergence to the optimum when IBMAP-HC is used to learn the 
structure. 

The rest of this work is organized as follows. Section [2] presents an overview 
of the independence-based learning approach and motivates our contribution. Sec- 
tion [3] presents IBMAP and the IBMAP-HC algorithm. Section H shows our ex- 
periments on synthetic and real datasets, and Section [5] shows our experiments 
on EDAs. Finally, Section [6] summarizes this work, and poses several possible 
directions of future work. 



2 Background 

This section provides some background on Markov networks, defines the problem 
of structure learning, and motives our independence-based approach. 

A Markov network representing an underlying distribution -P(V) over the set 
of n = |V| random variables V consists in an undirected graph G and a set of 
potential functions defined by a set of numerical parameters 0. The graph G is a 
map of the independences in -P(V), and such independences can be read from the 
graph through vertex separation, considering that each variable is conditionally 
independent of all its non-neighbor variables in the graph, given the set of its 
neighbor variables [26] . 

The structure G of -P(V) can be factorized into a product of potential functions 
4>c{V c ) over the completely connected sub-graphs (a.k.a., cliques) V c of G, that is, 

p(Y) = \ II MVc), 

c£cliques(G) 

where Z is the partition function, a constant that normalizes the product of po- 
tentials. Such potential functions are parameterized by the set of numerical pa- 
rameters 0. 
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The problem of structure learning takes as input a dataset D, which is assumed 
to be a representative sample of the underlying distribution P (V). Commonly, D 
is structured in a tabular format, with one column per random variable in the 
domain V, and one row per data point. The optimal solution of the problem is a 
perfect-map of -P(V) [26], that is, a structure that encodes all the dependences and 
the independences present in -P(V). The closer to a perfect-map, the better is the 
structure learned, and the better is the resulting Markov network for representing 

P(V)- ' 

Independence-based algorithms learn a perfect-map by performing a succes- 
sion of statistical independence tests, discarding at each iteration all structures 
inconsistent with the outcome of the test, and deciding on the tests to perform 
next based on the outcomes learned so far. 

A statistical independence test is a statistics computed from D which tests if 
two random variables X and Y are conditionally independent, given some condi- 
tioning set of variables Z. This independence assertion is denoted (X_LLY|Z) (or 
(X \\LY\Z) for dependence assertions) . The computational cost of a test is propor- 
tional to the number of rows in D, and the number of variables involved in the 
test. Examples of independence tests used in practice are Mutual Information [11] , 
Pearson's x 2 and G 2 [2], the Bayesian test [20], and for continuous Gaussian data 
the partial correlation test [32] . 

There are several advantages of independence-based algorithms. First, they can 
learn the structure without interleaving the expensive task of parameters estima- 
tion (contrary to score-based algorithms), reaching sometimes polynomial com- 
plexities in the number of statistical tests performed. If the complete model is 
required, the parameters can be estimated only once for the learned structure. 
Another important advantage of such algorithms is that they are guaranteed to 
learn the structure of the underlying distribution, as long as the following assump- 
tions hold: a) graph-isomorphism (i.e., the independences in the distribution can 
be encoded in an undirected graph) , b) the underlying distribution is strictly pos- 
itive (i.e., -P(V) > 0, for every assignment of V), and c) the outcomes of tests are 
correct (i.e., the independencies learned are true in P(V)). 

Unfortunately, the third assumption is rarely true in practice, as the number of 
contingency tables for which a statistics has to be computed grows exponentially 
with the number of variables in the conditioning set of the test. Therefore, the ef- 
fective dataset from which the statistic is computed decreases exponentially in size, 
thus degrading exponentially the quality of the statistics. When tests outcome in- 
correct independences, independence-based algorithms produce what is commonly 
called cascade errors [32], that not only discard the true underlying structure, but 
further confuse the algorithm in the test to perform next. Our approach tackles this 
main issue of independence-based algorithms by contemplating the uncertainty in 
the outcome of the tests through a probabilistic maximum-a-posteriori approach. 

3 The independence-based MAP approach 

We describe now the main contribution of this work: the IBMAP approach for 
Markov networks structure learning. The central idea is to aggregate the result 
of many statistical tests of conditional independence into the posterior probabil- 
ity P{G | D) of the independence structure G, and, by taking the maximum-a- 
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posteriori (MAP) model selection approach, selecting as output the structure that 
maximizes this posterior. Formally 

G* = argmaxPr(G | D). (1) 

G 

We proceed now to discuss how the individual posteriors Pr(G | D) are com- 
puted, and later in Section 13. II how to perform the MAP optimization efficiently. 
We start by replacing G in Pr(G | D) by the closure C(G), an equivalent repre- 
sentation of G consisting of a set of independence assertions that determines it. 
Formally 

Definition 1 (Closure) The closure of an undirected independence structure G 
is a set of conditional independence assertions, C(G) = {c{\, that are sufficient for 
completely determining the structure G of a positive distribution. 

We thus have that 

Pr(G | D) = Pr(C(G) | D). 
Applying the chain rule over the independence assertions in C(G) we obtain: 

\C{G)\ 

Pr(C(G)\D)= Pr( Cl |ci,..., Cl _i,^). (2) 

i = l 

To the best of the author's knowledge, there is no existing method for com- 
puting exactly the probabilities Pr(ci|ci, . . . , c%—i, D) of independence assertions 
conditioned on other independence assertions and data. A common approxima- 
tion is to assume that all the independence assertions in the closure are mutu- 
ally independent. This assumption is made implicitly by all the Markov networks 
independence-based algorithms [29_ , because the statistical tests are used as a black 
box, only using data for deciding independence for each assertion Cj. Applying the 
approximation to Eq. ([2]) we obtain 

Pr(C(G)\D)^l[Pr(c t \D) 

i 

that expressed in terms of logarithms to avoid underflow results in a computable 
expression that we call the IB-score: 

a(G) = ^logPr(c J | J D), (3) 

i 

where each term logPr(ci | D) can be computed using the Bayesian test of condi- 
tional independence [20] . 

In summary, the IBMAP approach consists in the following maximization: 

G* « argmax a(G). (4) 

G 

3.1 The IBMAP-HC algorithm 

This section present our structure learning algorithm IBMAP-HC, an efficient 
instantiation of the IBMAP approach of Eq. Q that uses a closure based on 
Markov blankets, and a heuristic hill-climbing search to find the MAP structure 
G* of Eq. ©. 
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3.1.1 Markov Blanket Closure 

The Markov blanket closure is the closure set used by IBMAP-HC, consisting of 
a polynomial number of independence assertions (in the number of nodes of the 
graph). This closure is defined on the concept of Markov blanket Bx C V \ {X} 
of a variable X £ V [16] , defined as the set of all the nodes connected to X by an 
edge [26J, i.e., its adjacency set. 

Definition 2 (Markov blanket closure) The Markov blanket closure of a struc- 
ture G is a set of independence assertions determined by the union of the Markov 
blanket closure Cx(G) of each variable X in the domain V, i.e., 

C{G) = U C X (G), (5) 
xev 

where each Cx(G) is the union of two mutually exclusive sets of independence 
assertions: 

C X (G) = { {X4LY\B X \{Y}) : Y eB x } U 

{ {XALY\B X ) : Y£B X }, (6) 

that is, for each neighbor Y of X add a dependence assertion between both vari- 
ables conditioned on the blanket of X, minus {Y}, and for each Y not a neighbor 
of X add an independence assertion between both variables conditioned on the 
blanket of X. 

Appendix [X] presents a detailed proof that the Markov Blanket Closure is 
indeed a closure, that is, it completely determines the structure G used to construct 
it. 

An interesting aspect of the Markov blanket closure proposed is that it re- 
sults in a useful decomposition of the IB-score of Eq. ([3|) , with important positive 
consequences in the efficiency of its computation: 



CT ( G ) = E E lo S Pr ( Cl | D). (7) 
xev i 

By Eq. the second summation in Eq. (JJJ has (n — 1) terms, one term per 
Y € V\{X}. Each of these terms is denoted hereon by o~x,y{G), and is computed 



_ f logPr«X4IY|B x > | D) if (X,Y) is an edge in G 
o-x,Y{^) |iogPr((X_U_y|B x ) | D) otherwise ' [ > 

The decomposition of Eq. ([7]) can also be seen as a decomposition over vari- 
ables, by grouping the second summation into the concept of variable score o~x{G). 
With this notation, Eq. (UJ) can be reformulated as 



a(G)= £ a x (G)= E ^,y(G). 

X6V XgV YeV\{X} 



(9) 
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This decomposition allows an incremental computation of the score a(G ) of 
some structure G' based on a previous computation of the score a(G) of a structure 
G, whenever G and G' differ by a constant number k of edges. For example, when 
G and G' differ by a single edge (X, Y), only the blankets of X and Y are affected, 
and therefore only independence assertions in Cx(G) and Cy{G) involving these 
blankets are modified. Because of the decomposability of the score, only variable 
scores ax{G) and ov(G) must be recomputed, with the possibility to reuse the 
(n — 2) variable scores aw(G), for all W 6 ~V\{X, Y}. This decomposition reduces 
the cost of computing the score for a structure G' from 0(n 2 ) to 0{n) tests, 
whenever the score of a neighbor structure G is already computed. For k edges 
differing between two structures at most 2k blankets are affected, with at most 
2k variable scores requiring re-computation, again, a cost of 0(n) tests. Such 
incremental computation of the score has an important impact in local search 
optimization algorithms, such as the IBMAP-HC algorithm described in the next 
section, that proceeds by exploring successively structures that differs in one edge. 

3.1.2 Our structure selection technique 

To conclude the presentation of IBMAP-HC, we present a specific structure selec- 
tion technique to perform the MAP search. The idea is to maximize the IB-score by 
a heuristic hill-climbing search in the space of structures, described in Algorithm[TJ 



Algorithm 1 IBMAP-HC (dataset D) 

1: G <— empty structure with n nodes (columns in D) 

2: current-score <— cr(G) 

3: repeat 

4: G' <— select-next-structure(G, c(G)) 

5: neighbor-score <— <r(G') 

6: if neighbor-score < current-score then 

7: return G 



function select-next- structure(G, cr(G)) 
11: (X*,y*)<- argmin u x .y{G) + cr Y ,x(G) 
(X,Y)eG 

12: return G with (X*,Y*) flipped 



The algorithm has as input parameter the dataset D, used for computing the 
statistical independence tests. The search starts by computing the IB-score of a 
structure G with n nodes (number of columns in D) and no edges (lines [1] and [2]), 
and then, in the main loop of line [3] iterates by selecting as the next structure 
G in the search, the neighbor of G with maximum score (function select-next- 
structure called at line U} . In this algorithm we consider as neighbor structures 
all those structures differing by exactly one edge. The algorithm stops when the 
neighbor proposed does not improve the score, a condition checked at line[6j If the 
termination criteria is not reached, the variables G and current-score are assigned 
by the new structure and its score, which means that an ascent was made. 



8: 
9: 
10: 



else 



G <- G' 

current-score 4— neighbor-score 
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The optimal neighbor structure is obtained by the function select-next-structure. 
A naive procedure would iterate over all (™) neighbors, with a cost of 0(n 2 ) tests 
for computing the score for each, and a total cost of 0(n 4 ) tests. Instead, we pro- 
pose here a heuristics that estimates the optimal neighbor without a single test 
computation, i.e., a cost of 0(1) test computations. Once the neighbor G' is se- 
lected, its score has to be computed (line [5]) for comparing it against the score of 
G. This score can be computed incrementally, resulting in a total computational 
cost per iteration of 0(n) tests, a clear reduction from the 0(n 4 ) of the naive 
procedure. 

We proceed now to explain and justify the heuristics for selecting the optimal 
neighbor in the select-next-structure function. Being the neighbors of G all those 
structures Gx,y differing with G by an edge (X, Y), there is a one-to-one corre- 
spondence between each pair (X, Y) and each neighbor Gx,y- The operation flip 
in line 1121 consist in adding an edge to G if it does not exist, or remove it other- 
wise. Also, by Eq. ([9]) and the discussion of incremental computation of the score 
at the end of the previous section, each of these pairs contributes to the score of G 
independently of the others by ax + ay , and this contribution is exactly the dif- 
ference between the scores of G and Gx,y- Therefore, the pair (X* ,Y*) with the 
smallest contribution in G, results in a corresponding neighbor G' = Gx* ,y* with 
the highest score among all neighbors. Performing the computation of ax + cry 
to get G' is equivalent to computing the score of G' incrementally with a cost 
of 0{n) tests, resulting in an overall cost of select-next- structure of 0(n 3 ) tests. 
The heuristics proposes an approximation that further reduces this computation, 
requiring not a single test computation. 

The heuristics consists in approximating ax(G) as ax,y{G), ignoring all other 
terms ax,w(G), W G V \ {X, Y}; and a similar approximation of ay(G) as 
ay,x(G). This heuristics is inspired by the fact that \ax,y(G) — <tx,y(Gx,y)\ 3* 
Wx,w{G) — ctw,y{Gx,y)\, for every W. Let us justify this. Since the edge (X,Y) 
is flipped between the two structures, it exists in one of them and does not exist 
in the other. Let us assume, without loss of generality, that it exists in G and 
does not exist in Gx,y ■ Therefore, according to Eq. (JSj) , the scores ax,y{G) and 
<Jx,y {Gx, y) °f the two structures G and Gx,y, respectively, evaluate to different 
cases: logPr«X -\LY\B X ) | D) for the case of G, and logPr((XU_Y \B X ) \ D) for 
the case of Gx,y ■ In contrast, for all other pair of variables (X, W), W £ V\ {X}, 
either the edge (X, W) exists in both structures G and Gx,y , or it does not exist 
in neither one. Therefore, their respective scores (Tx,y{G) and <tx,y(Gx,y) both 
evaluates to the same case, either logPr((X_|iLW^|Bx) | D) if the edge exists, or 
logPr((X^Y|Bx) | D) if it does not. The only difference between these scores 
is in the blanket of X, that contains Y only in the case of G. This justifies that 
the difference between both scores is expected to be much larger for the case of 
{X, Y} than for the case of {X, W}. 

Since the exact score of the selected neighbor G is computed in line [5l the 
only impact of the approximation in the hill-climbing search is the selection of 
a sub-optimal neighbor, which, although always climbs the search space, has the 
potential of producing a search termination before reaching a local maxima. Given 
the complexity of the problem, the impact of this approximation can only be 
assessed empirically. Later experiments show that despite this approximation, our 
approach still outperforms the state-of-the-art algorithms by reaching structures 
with higher quality. Moreover, Appendix[B]presents empirical measurements of the 
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complete landscape of the IB-score for several synthetic datasets, showing that in 
most cases, our structure selection strategy finds nearly optimal scores. 

At this point the only aspect that remains to discuss is the resulting compu- 
tational cost of the whole algorithm. The most expensive operation in the main 
loop becomes the computation of the (exact) IB-score of G' at line [5] with a cost 
of 0(n) when computed incrementally; resulting in an overall computational cost 
for the loop of 0{M ■ n), where M denotes the number of iterations until ter- 
mination. To this cost, it only remains to add the cost of computing the initial 
structure non- incrementally at line[TJ with a cost of 0(n 2 ). Therefore, the overall 
cost of the algorithm is 0(n 2 + Mn). Since M can be obtained only empirically, 
the experimental section show measurements of M on different scenarios, proving 
empirically that M is not a source of an extra degree in the complexity, because 
it grows at most linearly with n, resulting in an overall computational complexity 
ofO(n 2 ). 

4 Experimental results 

This section describes several experiments on synthetic and real datasets for test- 
ing empirically the robustness of our approach IBMAP, and the efficiency of our 
algorithm IBMAP-HC. We report a detailed and systematic experimental com- 
parison between IBMAP-HC and state-of-the art independence-based structure 
learning algorithms. For comparing all the algorithms on the same ground, we ran 
all of them using the Bayesian test as statistical independence test. 

We compare the quality of structures learned by our solution, against the 
quality of structures learned by GSMN [9], a state-of-the-art independence-based 
algorithm in terms of quality. We introduce also a competitor called HHC-MN, 
as an adaptation for learning the structure of Markov networks of the HHC algo- 
rithm [5], a state-of-the-art independence-based algorithm for learning Bayesian 
networks. 

The HHC algorithm learns the structure by learning the set of parents and 
children (PC) of each variable through the interleaved HITON-PC with symme- 
try correction algorithm [6l|l]. This is in fact possible for Bayesian networks, even 
though the Markov blanket of a variable is composed not only by the PC set, but 
also by the spouses of the variable (i.e., the other parents of its children). Inter- 
leaved HITON-PC executes at each iteration a step exponential in the size of the 
current estimate of the PC set. For the case of Markov networks, the equivalent 
of the PC of a variable is its neighbors, that is exactly its Markov Blanket. It is 
therefore expected that HITON-PC learns the Markov Blanket of a Markov net- 
work, and thus it can be used as part of HHC to learn the undirected structure. 
This fact is not proven analytically here, but confirmed empirically for all the cases 
considered in this section. To get a Markov network learning algorithm we then 
simply omit the final step of HHC that orients the edges, denoting the resulting 
algorithm by HHC-MN. As a final remark, we note that being the PC and Markov 
Blanket sets equivalent in Markov networks, the savings gained for Bayesian net- 
works are non-existent and thus HHC-MN is expected to scale to fewer variables 
than its Bayesian networks counterpart. 

The three following subsections describes our experiments with the competitor 
algorithms, over synthetic and real datasets. 
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4.1 Synthetic data experiments: random underlying structures 

A first set of experiments were conducted on artificial datasets, generated using a 
Gibbs sampler on randomly generated Markov networks (structure plus parame- 
ters) . This allows a systematic and controlled study, providing datasets with known 
underlying structures to allow the control of the complexity of the problem, and 
the ability to better asses the quality of structures obtained by each algorithm. 

To measure structural errors in the structures learned for each algorithm, we 
report the Hamming distance between the learned structure and the underlying 
one, i.e., the sum of false positive and false negative edges of the learned structure. 

Another quality measure that we use in this work for assessing the structures 
learned, is the well known F-measure, an harmonic mean of precision and recall 
quality measures, commonly used in the information retrieval community. Pre- 
cision indicates how good was the algorithm in learning correct independences 
(that is, the relation between the true independences that were found, over all 
independences found by the algorithm). Instead, recall indicates how good was 
the algorithm in learning independences, but over all the correct independences 
present in the real structure (that is, the relation between the correct indepen- 
dences that were found, over the total of independences existent in the underlying 
structure). Then, the F-measure is computed as follows: 



The synthetic random Markov networks were generated for domains of n G 
{75,100,200} binary variables. For each domain size, 10 random networks were 
generated for increasing connectivities r G {1,2,4,8}, by considering as edges the 
first tit/2 variable pairs of a random permutation of the set of all variable pairs. It 
is worth mentioning that with increasing values of r, it is increasingly difficult to 
learn the structure. Given these Markov networks, we report the quality of struc- 
tures learned by GSMN, HHC-MN, and IBMAP-HC using portions of each dataset 
with increasing number of datapoints D G {25,50,100,200,400,800,1600,3200}, 
for each (n, r) combination. 

The independence structure determines the factorization of the distribution 
into potential functions over subset of variables, one per clique in the structure. 
To determine a complete model we must determine the numerical parameters that 
quantify these potential functions. For the datasets generated to correctly and 
strongly represent the direct dependencies encoded by the edges, we considered 
in these experiments pair-wise cliques for the factorization of the models, that is, 
two- variable factors <j>(X, Y) for each edge in the random structure generated, and 
set the numerical parameters so that the correlation between them is strong. For 
that, we forced the parameters to result in a log-odds ratio of each pairwise factor 



£x,y = log ( 0(x=o'y=i)tft(x=i'y=o| ) to ^ e e(ma l to 1 f° r an edges (see [2]). This 



results in an equation over the values of the potential function with 4 unknowns. 
We therefore chose 3 parameters randomly in the range [0, 1], and solved for the 
remaining one. In our experiments we set e = 1.0. 

Figures [1] and [2] show the mean values and standard deviations over the ten 
repetitions, of the Hamming distances and F-measure of structures learned by the 
algorithms considered, respectively. The plots are ordered by columns for differ- 
ent n values, and by rows for different r values. The figures show clearly that 



F-measure = 



2 x precision x recall 



(10) 



precision + recall 
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Fig. 1 Mean and standard deviation over 10 repetitions of the Hamming distance of the mod- 
els learned by algorithms GSMN (black bars), HHC-MN (gray bars), and IBMAP-HC (light 
gray bars) for increasing sizes of random synthetic datasets, domain sizes n = 75 (left plot), 
n = 100 (center plot), and n = 200 (right plot), and r G {1, 2, 4, 8} in the rows. 



both, IBMAP-HC and HHC-MN learn structures with qualities significantly bet- 
ter (lower Hamming distance, or higher F-measure) than that of GSMN in all 
the cases. With respect to HHC-MN, the quality of the structures learned by 
IBMAP-HC are better or equal (up to statistical significance) in all the cases 
tested, except in the following specific cases: 

• t = 2,n > 75, D = 400, 

• T = 2,n = 200, D = 800, 

• T = 4,n > 75, D > 200. 

The best improvements are obtained for r = l,D G {1600,3200} for all n's, 
where IBMAP-HC results in no errors, while GSMN and HHC-MN still present 
errors. Also it is worth noting that in all the cases considered, IBMAP-HC out- 
performs against competitors significantly in all cases (n,r), when D < 100. 

Figure [3] shows the corresponding running times of the same experiment. Such 
results show clearly that both, IBMAP-HC and HHC-MN runtimes are lower than 
that of GSMN in all cases, except some cases where HHC-MN presents expensive 
running times, due to its exponential cost for high connectivities (the cases with 
t = 8, in the last row). 
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Fig. 2 Mean and standard deviation over 10 repetitions of the F-measure of the models 
learned by algorithms GSMN (black bars), HHC-MN (gray bars), and IBMAP-HC (light gray 
bars) for increasing dataset sizes of random synthetic datasets, domain sizes n = 75 (left plot), 
n = 100 (center plot), and n = 200 (right plot), and t £ {1,2,4, 8} in the rows. 



To conclude this section, we confirm empirically that IBMAP-HC achieves 
polynomial time complexities to the number of random variables in the domain. 
This is shown by Figure [3] that presents measurements of M (number of as- 
cents in the hill-climbing search) for increasing problem sizes n. Such results 
were obtained for datasets generated in the same way than the previous exper- 
iments. The figure shows the values of M for problems with increasing values 
of n e {4,12,16,20,24,30,50,75} in the X-axis, D = 1000, and a line for each 
t € {1,2,4,8}, indicating that M (Y-axis) grows linearly or slower. We omit re- 
sults for different D values, because they are similar. 

In summary, for synthetic datasets IBMAP-HC outperforms GSMN in quality 
in all cases, with equivalent runtimes, and outperforms HHC-MN in quality in 
most cases, with considerable improvements in runtime. 



4.2 Synthetic data experiments: Ising models 



A second set of experiments over synthetic datasets were conducted over a more 
interesting scenario, the Ising spin glasses models, that are mathematical models 
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Fig. 3 Mean and standard deviation over 10 repetitions of the runtimes required by al- 
gorithms GSMN (black bars), HHC-MN (gray bars), and IBMAP-HC (light gray bars) for 
increasing dataset sizes of random synthetic datasets, domain sizes n = 75 (left plot), n = 100 
(center plot), and n = 200 (right plot), and r G {1, 2, 4, 8} in the rows. 
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Fig. 4 Measurements in the number of ascents M (Y-axis) in the hill-climbing search of 
IBMAP-HC for increasing values of n (X-axis), and r 6 {1, 2, 4, 8}, D = 1000. 
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of ferromagnetism in statistical mechanics. Using such models as underlying struc- 
ture, ten datasets were generated for random Ising models with n € {75, 100, 200} 
binary variables. Figure [5] shows the results for ten different random repetitions. 
The graphs in such figure are ordered by columns for different n values, and show- 
ing in the first row the Hamming distance results, in the second row the F-measure 
results, and in the third row the corresponding runtimes. On the analysis of such 
results, we conclude they are similar to the case of random networks with r = 2, 
with IBMAP-HC outperforming GSMN and HHC-MN in all cases, in terms of 
Hamming distance, F-measure and runtimes. 
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Fig. 5 Mean and standard deviation over 10 repetitions of the Hamming distance (first 
row), F-measure (second row) and runtime (third row) of algorithms GSMN (black bars), 
HHC-MN (gray bars), and IBMAP-HC (light gray bars) for increasing datasctsizcs of Ising 
synthetic datasets, and domain sizes n = 75 (left plot), n = 100 (center plot), and n = 200 
(right plot). 



4.3 Benchmark datasets experiments 

In this section we show our experiments on benchmark (real-world) datasets. We 
used the publicly available benchmark datasets obtained from the UCI Reposito- 
ries of machine learning 1 and KDD datasets [15] , 

For benchmark datasets, since the underlying network is unknown, it is not 
possible to compute neither the Hamming distance nor the F-measure. Therefore, 
to measure the structure's quality we used a quantity called here accuracy, used for 
the same purpose in other related works [9,21,7 . The accuracy measure consists 
in a comparison of the outcome (true or false) of a number of tests performed on 
the structure learned by each algorithm using vertex separation, to the same tests 
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performed on the dataset. We define the accuracy as a normalized measure for 
counting the number of matches in a comparison of the independence assertions 
that holds in data and the structure learned. That is, if T denotes the set of all 
possible triplets over V, it is checked for how many triplets t £ T, t is independent 
(or dependent) in both the data, and the learned structure, and then normalized 
by T|. Unfortunately, the size of T is exponential, so we compute the approxi- 
mate accuracy over a randomly sampled subset T, uniformly distributed for each 
conditioning set cardinality. In our experiments we used \T\ = 100 x (™), i.e., a 
hundred triplets per conditioning set size. 

We conducted our experiment using 19 real- world domains, listed in Table [TJ 
column one. For each dataset D, we shuffled the data and then divided it into 
a training set for learning the structure (%75), and a test set for computing the 
accuracy (%25). The table also shows information about the number of attributes 
(second column), and the number of datapoints available in the train and test 
sets (third and fourth column). For each dataset we used the train set as input 
to the GSMN, HHC-MN, and IBMAP-HC algorithms, and the accuracy obtained 
for the structure learned for each algorithm is shown in the fifth, sixth and sev- 
enth columns, respectively. For each evaluation measure, the best performance is 
indicated in bold. Such results show that in 8 out of 17 datasets IBMAP-HC re- 
sulted in better accuracy, 6 cases resulted in ties (2 with GSMN, 1 with HHC-MN, 
and 3 with both), and for the remaining cases, the best results are obtained by 
HHC-MN(2 cases) and GSMN (1 case). 



Dataset 


n 


Train 
D 


Test 
D 


Accuracy 


GSMN 


HHC-MN 


IBMAP-HC 


machine 


10 


156 


52 


0.590 


0.567 


0.679 


lenses 


5 


17 


6 


0.881 


0.875 


0.881 


hepatitis 


20 


59 


20 


0.496 


0.633 


0.796 


hayes-roth 


6 


98 


33 


0.516 


0.516 


0.516 


crx 


16 


489 


163 


0.578 


0.593 


0.609 


cmc 


10 


1104 


368 


0.759 


0.711 


0.726 


car 


7 


1295 


432 


0.629 


0.641 


0.703 


bands 


38 


207 


69 


0.399 


0.408 


0.546 


baloons 


5 


14 


5 


0.950 


0.897 


0.950 


balance-scale 


5 


468 


156 


0.516 


0.516 


0.516 


nursery 


9 


9719 


3240 


0.392 


0.415 


0.649 


ecoli 


9 


251 


84 


0.523 


0.591 


0.694 


echocardiogram 


13 


45 


15 


0.696 


0.745 


0.745 


flag 


29 


145 


48 


0.446 


0.451 


0.803 


iris 


5 


112 


37 


0.695 


0.742 


0.736 


tic-tac-toe 


10 


718 


239 


0.671 


0.684 


0.498 


monks- 1 


7 


416 


139 


0.905 


0.905 


0.905 



Table 1 Accuracy for several real-world data sets. The structure is learned using a subsample 
called train set, and the accuracy is computed using the test set. For each evaluation measure, 
the best performance is indicated in bold. 
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5 IBMAP-HC for Estimation of Distribution Algorithms 

In contrast to benchmark datasets that comes from arbitrary applications, we 
present now results of evaluating IBMAP-HC in a real world application of knowledge- 
discovery: the Estimation of Distribution algorithms (ED As) [25,17. These are 
variations of the well-known evolutionary algorithms, that perform the same selec- 
tion and variation stages, but replace the crossover and mutation stages with the 
estimation and sampling in the task of generating a new population. The former 
stage estimate a probability distribution from the current population, generat- 
ing the next population by sampling from it (thus their name). In the estimation 
stage, EDAs estimate the probability distribution from the dataset corresponding 
to the current population. This is because they associate each gene to a random 
variable, each individual to a joint assignment of these variables, and the selected 
population to a sample of the distribution. The rationale for replacing crossover 
methods with estimation is that by estimating the distribution from the selected 
individuals, that is, those best fitted, the sampling stage would produce novel, yet 
well-fitted individuals. 

Several Markov networks based EDAs has been proposed recently that uses 
Markov networks for modeling the distribution [28,3,30,31 . As a test-bed we 
considered the Markovianity Optimization Algorithm (MO A) [31] . This is a state- 
of-the-art MN-based EDA that learns the Markov network structure from the 
population using an efficient structure learning algorithm based on mutual infor- 
mation (MI), a simple independence-based structure learning algorithm, described 
in detail in the same work, and designed specifically for MOA. The sampling in 
MOA is conducted through a variation of a Gibbs sampler that requires only the 
structure of the model, avoiding the need to learn the model parameters. The im- 
plementation of MI in MOA takes advantage of experts information indicating the 
maximum number of neighbor variables that a variable can have, denoted here k. 
We tested MI for different values of k (results not shown here), observing great 
sensitivity of MI to its value. Our algorithm IBMAP-HC does not use such pa- 
rameter. In the experiments below we set the value of k for MI to be the closest to 
the true value, resulting in the best possible performance of MI, i.e., the strongest 
competitor for IBMAP-HC. 



n 


MOA 


MOA' 


D* 


/* 


D* 


r 


15 


50 


267.50 (35.45) 


50 


202.50 (14.19) 


30 


200 


1170.00 (94.87) 


100 


475.00 (42.49) 


60 


800 


5200.00 (98.46) 


200 


1050.00 (52.70) 


90 


800 


5560.00 (126.49) 


400 


2220.00 (63.25) 


120 


1600 


11200.00 (871.53) 


800 


4400.00 (312.33) 



Table 2 Results of MOA and MOA' (that uses IBMAP-HC) for the OneMax problem, for 
increasing problem sizes (rows) in terms of critical population size D* , and mean and standard 
deviation over 10 repetitions of the number of fitness evaluations /* required to obtain the 
global optimum. 



We conducted experiments to compare IBMAP-HC as an alternative structure 
learning within MOA, denoted MOA', and denoting by MOA the original version 
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that uses MI. The thesis is that a better structure learning algorithm improves the 
convergence of MOA, that is, the optimum is reached computing fewer evaluations 
of the fitness of individuals. Both versions were tested on two benchmark functions 
widely used in the EDA's literature: Royal Road and OneMax, both bit-string 
optimization tasks, detailed in [24 . Each bit-string is modeled in the context of 
evolutionary algorithms as a chromosome and each bit as a gene. In the Royal Road 
problem, the variables are arranged in groups of size 7. Its goal is to maximize the 
number of Is in the string, but adding 7 to the fitness count only when a group 
has all Is, otherwise adding 0. For example, in the case of 7 = 4, an individual 
111110011111 is separated in the groups [1111] [1001] [1111], and only the first 
and third groups contributes 4 to the fitness count, which in the example equals 8. 
The underlying independence structure that should be learned therefore contains 
cliques of size 7, one per group. In our experiments we used 7 = 1 and 7 = 4. The 
former is known in the literature as OneMax. In the example, the fitness is 10 for 
OneMax. Clearly, the optimal individual for both problems is 111111111111. 



n 


MOA 


MOA' 


D* 


r 


D* 


/* 


16 


100 


545.00 (59.86) 


50 


337.50 (176.09) 


32 


400 


3800.00 (210.82) 


400 


2140.00 (134.99) 


64 


800 


9120.00 (252.98) 


800 


4440.00 (126.49) 


92 


1600 


18400.00 (533.33) 


800 


5080.00 (500.67) 


120 


1600 


31120.00 (822.31) 


1600 


9840.00 (386.44) 



Table 3 Results of MOA and MOA' (that uses IBMAP-HC) for the Royal Road problem, for 
increasing problem sizes (rows) in terms of critical population size D* , and mean and standard 
deviation over 10 repetitions of the number of fitness evaluations /* required to obtain the 
global optimum. 



In the experiments, MOA is iterated for 1000 generations or until the optima is 
reached, whatever happened first. For several runs differing in the initial (random) 
population, we measured the success rate as the fraction of times the optima is 
found. A commonly used measure of performance in ED As is the critical population 
size D*; the minimum population size for which the success rate is 100%. Smaller 
D* values have a double benefit over runtime: (i) fewer fitness evaluations for 
reaching the optima, and (ii) faster distribution estimation. We report D* and the 
number of fitness evaluations required for that population size, denoted /*. More 
robust algorithms are expected to require smaller D* and /* values. To measure 
D* in Royal Road and OneMax, each version of MOA was run 10 times for each 
of the population sizes D = {50, 100,200,400,800, 1600,3200}. Then, for that D* , 
we report the average and standard deviation of /* on each of those runs. In all 
the experiments, the population is truncated with a selection size of 50% and an 
elitism of 50%; used for preventing diversity loss. In MOA, the parameter k was 
set to 3 and 1 in Royal Road and OneMax, respectively. 

Results are presented in Table [2] for the OneMax problem, and Table [3] for 
the Royal Road problem. For both algorithms MOA and MOA', each table report 
values of D* , and the average and standard deviation (in parenthesis) of /*, for 
increasing problem sizes n = {15,30,60,90,120} for the OneMax problem, and 
n = {16,32,64,90, 120} for the Royal Road problem (domains multiple of 7 = 4 
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are required). In both tables, the results show that for /*, MOA' always outper- 
forms MOA; while for D* , it is always equal or lower. For Royal Road, the larger 
improvement is for n = 92 where MOA' requires 75% fewer fitness evaluations /* 
and D* is halved. For OneMax, the larger improvement is for n = 60 where MOA' 
requires 80% fewer fitness evaluations /* and D* is reduced to a quarter. 

An interpretation of these results is that IBMAP-HC estimates better the 
distribution. To confirm this hypothesis we compared the structures learned by the 
two algorithms over the same synthetic datasets considered in the previous section. 
For n = 75, D = 100, t = 2, the Hamming distances of MI and IBMAP-HC were 
132, and 75, respectively. For r = 4 they were 233 and 143, respectively; and for 
t = 8, 395 and 388, respectively. These results show clearly that the quality of 
IBMAP-HC indeed outperforms that of MI. Finally, we highlight that the efficiency 
of IBMAP-HC allowed it to be run in large problems up to 120 genes in size, 
estimating the structure over many generations. 

6 Conclusions and future work 

This paper proposes a novel independence-based, maximum-a-posteriori approach 
for learning the structure of Markov networks; and IBMAP-HC, an efficient instan- 
tiation of IBMAP. Our method follows an independence-based strategy for getting 
the MAP independences structure from data proposing an independence-based 
score. Experiments comparing IBMAP-HC against state-of-the-art independence- 
based algorithms indicate that our method improves in most cases over the independ 
based competitors with equivalent computational complexities. IBMAP-HC was 
also tested in a practical, challenging setting: Estimation of Distribution algo- 
rithms, resulting in faster convergence to the optimum than a state-of-the-art 
Markov network EDA algorithm, for the selected benchmark functions. According 
with our experimental results, and the conclusions of Appendix [Bj the effective- 
ness of our structure selection strategy is confirmed, and therefore we believe that 
it is worth guiding our future work in improving the IB-score as a measure of 
Pr(G | D), i.e., relaxing the independence assumption made in Equation ([3]), as 
well as exploring alternative closure sets. Also, it is clearly worthwhile considering 
testing our approach in more practical real world testbeds, potentially comparing 
its performance against state-of-the-art score-based algorithms, such as [14,27, 12, 

[33]. 
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A Completeness of Markov blanket closure 

This appendix presents Theorem [T] a formal proof that the Markov blanket closure described 
in Definition [2] of Section l3.1.1l is in fact a closure, i.e., its independence assertions completely 
determine the structure used to generate it. 

Let us start by reproducing some necessary theoretical results extracted from 16 18,26 : 
the pairwise Markov property, the Intersection property of conditional independence, and the 
Strong Union property of conditional independence, all satisfied by any Markov network G of 
a positive graph-isomorph distribution P: 

Definition 3 (Pairwise Markov property) Let G be a Markov network of some graph- 
isomorph distribution P, then 

(X,Y) £ E(G) & (XALY\V\{X,Y}) in P. (11) 

Definition 4 (Intersection) The conditional independences among random variables of a 
positive distribution P satisfy the Intersection property (expressed in counter-positive form): 

(X4i-Y\Z) A (X1LW\Z,Y) => (XALY\Z,W) (12) 
for all (X ^ Y ^ W) Z. 

Definition 5 (Strong Union) The conditional independences among random variables of 
a graph-isomorph distribution P satisfy the following Strong Union property of conditional 
independence: 

(XALY\Z) => (XALY\Z, W) (13) 

for all (X ^ Y) £ Z. 

We present now two auxiliary lemmas that relate independences with edges in the graph: 
Lemma 1 

(X_U_y|B x \{Y}> => (X,Y)£E(G). (14) 

Proof. The proof proceeds by first applying the Strong union property to the l.h.s. to 
obtain (X1LY\V \ {X, Y}), and then applying the pairwise property to conclude the r.h.s. 
(X,Y)iE(G). □ 

For the remaining of the proof we need to argue that something similar to the counter- 
positive of Lemma [Tl holds: 

Lemma 2 

{X4LY\B X \{Y}) A \/W £ B x {XALW\Z,Y) =>■ (X,Y) e E(G). (15) 

Proof. The proof proceeds by extending the conditioning set Bx\{l / } of the l.h.s. to the 
whole domain V\{X, Y}, to then apply the counter-positive of Eq. (1111) and reach the r.h.s. 
(X, Y) £ E(G). For that, we apply the intersection property of Eq. (1121 iteratively, by taking 
at each iteration the pair containing one of the independences in the l.h.s., and, in the first 
iteration the dependence in the l.h.s., and the following iterations the dependence resulting 
from applying intersection. In all cases, we take Z = Bx\{^}- Let see this process in detail. 
In the first iteration we take from the l.h.s. the dependence and the independence for the first 
W, obtaining, by intersection, the dependence (X^LY\Z,W). We can now take the resulting 
dependence, with the independence for the following W, denoted for convenience W . It seems 
that intersection can no longer be applied because the respective conditioning sets ZU{W} and 
ZU{y} does not match. However, by graph-isomorphism of P, we have that the Strong Union 
property of conditional independence is satisfied in P, and therefore any independence given 
some conditioning set follows from the same independence given a subset of this conditioning 
set, in particular then, we have that (^_LLW|Z, W, Y), and intersection can therefore be ap- 
plied, resulting in (X4LY\Z, W, W). Following this iteratively, we reach (X4LY\V \ {X, Y}), 
where the conditioning set is the result of Z = Bx\{Y} U Bx, recalling X £ Bj. 

□ 

We can now prove our main theorem: 
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Theorem 1 The Markov blanket closure of a structure G, as stated in Definition UH is a 
set of conditional independence assertions that are sufficient for completely determining the 
structure G of a positive graph-is omorph distribution. 

Proof. We prove the above theorem by proving that all the edges and no edges in G are 
determined by the assertions contained in C(G). We do it separately for absence and existence 
of edge between any two variables X and Y: 

i) For edge absence: Let (X, Y) £ E(G). Then, by definition, the closure contains the two 
independence assertions: {XALY \B X \{Y}) and (K_LLX|By \{X}>, which, by Eq. (H) of 
Lemma[T]both imply (X,Y) £ E(G). 

ii) For edge existence: 

Similarly, let (X,Y) S E(G). Then, by definition, the closure contains the dependence 
assertion: (X^LY \B X \{Y}). Also, for all W s.t. (X, W) £ E(G) (i.e., W £ B x ), the closure 
contains {XALW\B X )- Then, by Eq. JTS) of Lemma||]we have that (X, Y) G E{G). □ 
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B IBMAP landscape analysis 



In this appendix we report the results of an experiment that analyzes empirically the land- 
scape of the IB-score function on synthetic datasets. The experiment consists in an analysis 
of the surface of the IB-score over the complete search space of possible structures. The aim 
is to assess how good is the hill-climbing search for maximizing the IB-score. Due to the ex- 
ponential number of possible networks for each domain, in a first instance we explore how the 
complete landscape of IB-score looks like for datasets with a small domain size n = 6. For this 
experiment, we used synthetic datasets similar to those used in Section |4. II 

The plots in Figure \E\ show in the Y-axes the values of the IB-score for all the possible 
structures, and sort the structures in the X-axes, by its Hamming distance to the true under- 
lying structure in the dataset (this is, from zero, to Q))- Note that the scores of the structures 
appear in log probabilities, because they was computed as shown in Equation (0. With this 
layout, the structures in the left (near to zero) are those with less structural errors, and are 
also those expected to have a higher value of the IB-score. Therefore, the structures in the 
right are expected to have lower values of the IB-score. Also, indicated with a diamond, the 
structures found by the algorithm IBMAP-HC are shown for each case. 



D=10 



D=100 



D=1000 




Hamming distance 



ng distance 




-100 
-150 
-200 
-250 
-300 
-350 
-400 



-200 
-400 
-600 
-800 
-1000 
-1200 
-1400 
-1600 
-1800 
-2000 



Hamming distance 



-1000 
-2000 
-3000 
-4000 
-5000 
-6000 ■ 



Fig. 6 Complete landscape of the IB-score for synthetic datasets with n = 6, for increasing 
dataset sizes D = 10 (left column), D = 100 (center column), and n = 1000 (right column), 
and r £ {1, 2, 4, 8} in the rows. The X-axis sort the structures in the Hamming distance with 
the correct structure. The Y-axis shows the IB-score for all the structures in the landscape. 
The structure found by IBMAP-HC is indicated by a diamond. 
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The plots are ordered in the columns for increasing values of the dataset D £ {10, 100, 1000}, 
and in the rows, the different values of r £ {1, 2, 4, 8}, increasing the complexity of the problem. 
From the analysis of such plots, it is observed how the landscape shapes to a decreasing curve as 
increasing the value D (see the tendency from left to right columns) . This is achieved because 
the precision of the statistical tests improves with increasing D. In second place, the diamond 
that indicates the position in the landscape of the structure learned by the IBMAP-HC algo- 
rithm, achieves always the structure with highest score value. It can be also observed how the 
error of the structure learned by IBMAP-HC is closer to zero while increasing D. 
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Fig. 7 A fraction of the landscape of the IB-score for synthetic datasets with n = 20, for 
increasing dataset sizes D = 10 (left column), D = 100 (center column), and n = 1000 (right 
column), and r £ {1,2,4,8} in the rows. The X-axis sort the structures in the Hamming 
distance with the correct structure. The Y-axis shows the IB-score for all the structures in the 
landscape. The structure found by IBMAP-HC is indicated by a diamond. 



A second instance of this experiment was made for a domain size n = 20. In this instance, 

(20\ 

the landscape contains a total size of 2v 2 / . As it is impossible to show the IB-score for the 
complete landscape, we show only a subset obtained by generating randomly k = 5 structures 
deferring in m edges with the true structure, with m from to ( 2 2 °) in the X-axis. Such results 
are shown in Figure [7] From the analysis of such plots, the same conclusions are observed. 

To conclude this appendix, it is worth noting that our results confirm the effectiveness of 
our structure selection strategy in maximizing the IB-score over the complete landscape. For 
that reason, we conclude that it is worth guiding our future work only in the improvement of 
the IB-score as a measure of Pr(G | D). 
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